[WIP] Improve multimodal processors - rely less on kwargs #28711

molbap · 2024-01-25T17:42:35Z

What does this PR do?

This PR aims at a better control on the logic flow through Processor classes, in particular those leveraging ImageProcessor with a Tokenizer. Linked with #27768.

ImageProcessors compared to Nougat (as a reference point) have different signatures in their preprocess. One can list them here

TvltImageProcessor:
videos, patch_size, crop_size, do_center_crop, is_mixed, num_frames

IdeficsImageProcessor:
transform, image_num_channels, image_size

ViTImageProcessor:
No difference in args

Mask2FormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, reduce_labels, instance_id_to_semantic_id

MaskFormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, do_reduce_labels, instance_id_to_semantic_id

YolosImageProcessor:
format, return_segmentation_masks, annotations, masks_path

MobileNetV1ImageProcessor:
do_center_crop, crop_size

DeiTImageProcessor:
do_center_crop, crop_size

EfficientNetImageProcessor:
include_top, do_center_crop, rescale_offset, crop_size

BeitImageProcessor:
do_reduce_labels, do_center_crop, segmentation_maps, crop_size

MobileViTImageProcessor:
do_flip_channel_order, do_center_crop, segmentation_maps, crop_size

PerceiverImageProcessor:
do_center_crop, crop_size

DeformableDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

EfficientFormerImageProcessor:
do_center_crop, crop_size

SegformerImageProcessor:
do_reduce_labels, segmentation_maps

LayoutLMv2ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

BridgeTowerImageProcessor:
do_center_crop, size_divisor

SamImageProcessor:
segmentation_maps, pad_size, do_convert_rgb, mask_pad_size, mask_size

BlipImageProcessor:
do_convert_rgb

Owlv2ImageProcessor:
No difference in args

LayoutLMv3ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

DetaImageProcessor:
format, return_segmentation_masks, annotations, masks_path

BitImageProcessor:
do_center_crop, do_convert_rgb, crop_size

ViTHybridImageProcessor:
do_center_crop, do_convert_rgb, crop_size

FuyuImageProcessor:
patch_size, padding_mode, padding_value

PvtImageProcessor:
No difference in args

Pix2StructImageProcessor:
max_patches, header_text, do_convert_rgb, patch_size

VitMatteImageProcessor:
trimaps, size_divisibility

VideoMAEImageProcessor:
videos, do_center_crop, crop_size

MobileNetV2ImageProcessor:
do_center_crop, crop_size

OneFormerImageProcessor:
segmentation_maps, ignore_index, task_inputs, do_reduce_labels, instance_id_to_semantic_id

FlavaImageProcessor:
crop_size, codebook_crop_size, codebook_rescale_factor, mask_group_max_patches, mask_group_min_patches, mask_group_max_aspect_ratio, codebook_image_mean, codebook_do_resize, return_image_mask, input_size_patches, codebook_do_center_crop, codebook_resample, mask_group_min_aspect_ratio, codebook_do_normalize, codebook_do_map_pixels, return_codebook_pixels, codebook_image_std, do_center_crop, codebook_size, codebook_do_rescale, total_mask_patches

DonutImageProcessor:
random_padding

TvpImageProcessor:
videos, crop_size, constant_values, do_flip_channel_order, do_center_crop, pad_size, pad_mode

GLPNImageProcessor:
size_divisor

PoolFormerImageProcessor:
crop_pct, do_center_crop, crop_size

CLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

DPTImageProcessor:
ensure_multiple_of, keep_aspect_ratio, size_divisor

ViltImageProcessor:
size_divisor

Swin2SRImageProcessor:
pad_size

ImageGPTImageProcessor:
clusters, do_color_quantize

SiglipImageProcessor:
No difference in args

VivitImageProcessor:
videos, do_center_crop, offset, crop_size

ConvNextImageProcessor:
crop_pct

OwlViTImageProcessor:
do_center_crop, crop_size

ChineseCLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

LevitImageProcessor:
do_center_crop, crop_size

ConditionalDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

DetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

This helps standardize a bit in the first place, and then, will allow uniformizing Processors.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Models:

text models: @ArthurZucker and @younesbelkada
vision models: @amyeroberts

HuggingFaceDocBuilderDev · 2024-01-25T18:04:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Nice - looking good!

Let me know when you want another review 🤗

amyeroberts · 2024-02-06T15:22:20Z

tests/models/bridgetower/test_image_processing_bridgetower.py

@@ -39,7 +39,7 @@ def __init__(
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
-        do_center_crop: bool = True,


What's the reason for changing the default here? I think BridgeTowerImageProcessor defaults to this being True, so would have this value, even if not passed here

Yes, was conflicted on this... this is the current version of preprocess for bridgetower in main, missing a new variable declaration:

transformers/src/transformers/models/bridgetower/image_processing_bridgetower.py

Line 448 in 115ac94

do_center_crop if do_center_crop is not None else self.do_center_crop

In that case, the do_center_crop that was used in preprocess was the preprocess default, i.e. None instead of whatever was in the __init__, right?

@amyeroberts on that, bridgetower did default do_center_cropto being True, but preprocess was not capturing it (it was a bug). The get_expected_values() method does not include mentions on center_crop and will crate expected uncropped values. From that

Either change ...expected_values... getter to something that includes cropping in logic

change default of tester to match previous behaviour

amyeroberts · 2024-02-06T15:23:19Z

src/transformers/models/donut/processing_donut.py

-        if len(args) > 0:
-            images = args[0]
-            args = args[1:]


It's very nice to see this go :)

amyeroberts · 2024-02-06T15:24:11Z

src/transformers/models/bridgetower/processing_bridgetower.py

        )
        # add pixel_values + pixel_mask
+        print(size)


Suggested change

print(size)

amyeroberts · 2024-02-06T15:34:30Z

src/transformers/models/bridgetower/image_processing_bridgetower.py

+        valid_processor_keys = {
+            "images",
+            "do_resize",
+            "size",
+            "size_divisor",
+            "resample",
+            "do_rescale",
+            "rescale_factor",
+            "do_normalize",
+            "image_mean",
+            "image_std",
+            "do_pad",
+            "do_center_crop",
+            "return_tensors",
+            "data_format",
+            "input_data_format",
+            "pad_and_return_pixel_mask",
+        }
+
+        unused_keys = set(kwargs.keys()) - valid_processor_keys
+        if unused_keys:
+            unused_key_str = ", ".join(unused_keys)
+            logger.info(f"Unused or unrecognized configuration parameters: {unused_key_str}.")


Two comments:

What's the reason for this being added for some image processors but not others e.g. donut also accepts kwargs?

Could we abstract out this check to something similar with Abstract image processor arg checks. #28843 using either inspect or an explicit class attribute?

Good points!

no reason at all - all for adding a similar logic to all!

inspect is a bit slow, right? But yes, also +1 to abstracting this away, I'll move it to another PR

@amyeroberts was thinking about decorators: wdyt about having this as a wrapper/decorator in image_utils or elsewhere in utils?

valid_processor_keys = inspect.getfullargspec(self.preprocess)[0] unused_keys = set(kwargs.keys()) - valid_processor_keys if unused_keys: unused_key_str = ", ".join(unused_keys) logger.info(f"Unused or unrecognized configuration parameters: {unused_key_str}.")

LGTM!

Perhaps instead of inspecting, we could have a class attribute with the valid_processor_keys? e.g. something like this:

class FooImageProcessor: _valid_processor_keys = ['bar', 'baz'] ... def preprocess(..., kwargs): validate_kwargs(self._valid_processor_keys, kwargs)

Having a class attribute like this is something I think I'm going to end up with in the #28847 design

sounds good yes! self-contained classes seem worth losing the decorator

amyeroberts · 2024-02-06T15:36:25Z

src/transformers/models/bridgetower/image_processing_bridgetower.py

        **kwargs,
    ) -> None:
-        if "pad_and_return_pixel_mask" in kwargs:
-            do_pad = kwargs.pop("pad_and_return_pixel_mask")
+        if pad_and_return_pixel_mask:


Let's set the default value for size in the init to avoid having mutables as default arguments

Suggested change

if pad_and_return_pixel_mask:

size = {"shortest_edge": 288} if size is None else size

if pad_and_return_pixel_mask:

amyeroberts · 2024-02-06T15:41:54Z

src/transformers/models/blip/processing_blip.py


        if text is not None:
+            self.current_processor = self.tokenizer


Current processor behaviour is deprecated, so we don't need to set it here. In fact, we should probably create a current_processor property which shows a deprecation message when used

Suggested change

self.current_processor = self.tokenizer

Okay, that's good to know! Seen it in another instance I think, I'll drop it in that case and add the message

Yes, it's definitely still around in places. For context, we used to have a behaviour when the current_processor was selected through a context manager. The context manager behaviour was removed, but there's still remnants of this, even though current_processor normally has no effect.

amyeroberts · 2024-02-06T15:42:52Z

src/transformers/models/blip_2/processing_blip_2.py


        if text is not None:
+            self.current_processor = self.tokenizer


Same comment here about current processors

amyeroberts · 2024-02-06T16:28:53Z

src/transformers/models/bridgetower/image_processing_bridgetower.py

+        do_center_crop: Optional[bool] = True,
+        do_pad: Optional[bool] = True,


These shouldn't be optional in the init

Suggested change

do_center_crop: Optional[bool] = True,

do_pad: Optional[bool] = True,

do_center_crop: bool = True,

do_pad: bool = True,

absolutely, fixed in #28843 which should be merged before that one

molbap added 9 commits January 25, 2024 11:02

expand kwargs from align

42ecf48

remove kwargs from altclip processor

ccb2147

add explicit args for donut processor

f999e0c

add explicit call to current processor for in context manager

8fb3a6b

format

a90c766

remove unused kwargs

49cb6cc

move conditions for encodings

3ac1c7e

improve flow over text/image

7a819fd

[breaking] pass explicit args to bridgetower

9cc38b7

molbap added 13 commits January 26, 2024 12:44

wwsMerge branch 'main' into improve_multimodal_processors

ff6a950

add default kwargs for BC

7db64a0

fix bridgetower

41674d9

debug bridgetower image proc

618a687

format

f39cdc1

move kwargs message to info level

9a6f97d

add debug messages

380f82f

fix arguments not being passed in bridgetower

75f15d3

keep backwards compat for processing + modify testing args dict

3df5faa

Merge branch 'main' into improve_multimodal_processors

5ad0694

fix quality

69e5a2d

log kwargs mismatch to info level

68c2f40

fix quality

e1e4084

younesbelkada mentioned this pull request Feb 2, 2024

add test marker to run all tests with @require_bitsandbytes #28278

Merged

5 tasks

molbap mentioned this pull request Feb 2, 2024

Abstract image processor arg checks. #28843

Merged

amyeroberts reviewed Feb 6, 2024

View reviewed changes

molbap added 4 commits February 15, 2024 14:53

Merge branch 'main' into improve_multimodal_processors

bfa81e5

address comments

4b557b0

fix typo

b7fc377

fix expected tests for bridgetower

270bb9e

molbap mentioned this pull request Feb 16, 2024

Raise unused kwargs image processor #29063

Merged

1 task

molbap added 5 commits February 26, 2024 09:42

fix conflicts

94a1b75

Merge branch 'main' into improve_multimodal_processors

6603bf0

fix valid processor keys

004c961

remove unused arg list

c2e49f5

quality

79958b5

huggingface deleted a comment from github-actions bot Mar 22, 2024

huggingface deleted a comment from github-actions bot Apr 16, 2024

molbap added 5 commits April 17, 2024 12:14

Merge branch 'main' into improve_multimodal_processors

a36f524

skeleton draft - uniform processor call

3238dd3

fix quality

3afde22

add broken wav2vec audio processing

eb99e29

Merge branch 'main' into improve_multimodal_processors

c6afd63

molbap mentioned this pull request Apr 26, 2024

Image + text + audio uniform processors #30511

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Improve multimodal processors - rely less on kwargs #28711

[WIP] Improve multimodal processors - rely less on kwargs #28711

molbap commented Jan 25, 2024

HuggingFaceDocBuilderDev commented Jan 25, 2024

amyeroberts left a comment

amyeroberts Feb 6, 2024

molbap Feb 8, 2024

molbap Feb 16, 2024

amyeroberts Feb 6, 2024

amyeroberts Feb 6, 2024

amyeroberts Feb 6, 2024

molbap Feb 8, 2024

molbap Feb 15, 2024

amyeroberts Feb 15, 2024

molbap Feb 15, 2024

amyeroberts Feb 6, 2024

amyeroberts Feb 6, 2024

molbap Feb 8, 2024

amyeroberts Feb 8, 2024

amyeroberts Feb 6, 2024

amyeroberts Feb 6, 2024

molbap Feb 8, 2024

		do_center_crop: Optional[bool] = True,
		do_pad: Optional[bool] = True,

[WIP] Improve multimodal processors - rely less on kwargs #28711

Are you sure you want to change the base?

[WIP] Improve multimodal processors - rely less on kwargs #28711

Conversation

molbap commented Jan 25, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jan 25, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment